HyperMinHash: Jaccard index sketching in LogLog space

نویسندگان

  • Y. William Yu
  • Griffin Weber
چکیده

In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index (or Jaccard similarity coefficient) over two sets A and B. HyperMinHash can be thought of as a compression of standard logn-space MinHash by building off of a HyperLogLog count-distinct sketch. For a multiplicative approximation error 1+ on a Jaccard index t, given a random oracle, HyperMinHash needs O ( −2 ( log logn+ log 1 t )) space. Unlike comparable Jaccard index fingerprinting algorithms (such as b-bit MinHash, which uses less space), HyperMinHash retains MinHash’s features of streaming updates, unions, and cardinality estimation. Our new algorithm allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 10 with relative error of around 10% using 64KiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 10 with the same memory consumption. Note that we will operate in the unbounded data stream model and assume both a random oracle and shared randomness.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Assessing the Use of Similarity Distance Measurement in Shape Recognition

Distance measure is one of the techniques widely used to measure the similarity between two feature matrices of objects. The objective of this paper is to explore researches on applied distance measures in shape-based recognition. In distance measures computation, patterns that are similar will have a small distance while uncorrelated pattern in the feature space will have a far a part distance...

متن کامل

Consistent Weighted Sampling Made Fast, Small, and Easy

Document sketching using Jaccard similarity has been a workable effective technique in reducing nearduplicates in Web page and image search results, and has also proven useful in file system synchronization, compression and learning applications [6, 4, 5]. Min-wise sampling can be used to derive an unbiased estimator for Jaccard similarity and taking a few hundred independent consistent samples...

متن کامل

LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting

​ —The information presented in this paper defines LogLog-Beta (LogLog-β). LogLog-β is a new algorithm for estimating cardinalities based on LogLog counting. The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement. Our simulations show that the accuracy provided by the new al...

متن کامل

Current Discussions on Digital Sketching in the Early Stages of Architectural Design in Education

In the architectural design, designers are focused on the early stages of the design process or conceptual design. The ultimate goal of this stage is to find a solution for an existing problem, investigate design space, or explore an idea. This stage conventionally begins with sketches and diagrams to explore ideas and solutions; the ambiguity and vagueness of conventional freehand sketching ca...

متن کامل

Unilateral Jaccard Similarity Coefficient

Similarity measures are essential to solve many pattern recognition problems such as classification, clustering, and retrieval problems. Various similarity measures are categorized in both syntactic and semantic relationships. In this paper we present a novel similarity, Unilateral Jaccard Similarity Coefficient (uJaccard), which doesn’t only take into consideration the space among two points b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1710.08436  شماره 

صفحات  -

تاریخ انتشار 2017